Our Problem

Our Audience

Our audience comprises the acquisition team of IBM (International Business Machines Corporation), a global technology conglomerate renowned for its diverse hiring practices and varied workforce productivity. Central to their concerns is the challenge of worker attrition. Our objective is to provide guidance that ensures the recruitment of high-quality candidates with a diminished likelihood of turnover. By optimizing hiring practices in this manner, we aim to enhance profitability in both the short and long term for IBM.

Our Question

Our paramount concern is addressing how we can optimize both the caliber and overall productivity of our workforce, while safeguarding against premature turnover. At the core of this inquiry lies the imperative to maximize the profitability derived from each hire, considering that recruiting individuals who swiftly depart leads to negative returns on investment, particularly in terms of training costs.

Hence, it is imperative that we meticulously screen candidates to avert scenarios where early attrition undermines our bottom line

Our Narrative

IBM’s workforce spans across various fields, yet recent trends reveal a concerning pattern of premature departures, resulting in significant profit setbacks. The root of the issue lies in the acquisition team’s emphasis on short-term productivity, overlooking the crucial factor of employee retention.

Similar to university admissions strategies, where institutions like the University of Michigan might forgo exceptional applicants due to concerns about their commitment, IBM must adopt a nuanced approach. We propose a strategy that balances maximizing immediate output with the long-term goal of retaining top talent.

Our solution involves screening candidates not only for their potential contributions to IBM but also for their propensity to remain with the company. By prioritizing individuals who demonstrate both high potential and a commitment to long-term engagement, we mitigate the risk of negative profit margins associated with frequent turnover.

In essence, our tailored models enable IBM to navigate the seemingly paradoxical challenge of selecting candidates who are both highly productive and likely to stay. By investing in employees who align with the company’s long-term vision, IBM can secure greater profitability in the years ahead.

How we Quantify our Work

In order to quantify the impact of our work, we have assigned monetary values to our final outcomes, streamlining worker productivity into a comprehensible variable. This variable encompasses several factors: tenure, job level, engagement, overtime commitment, performance ratings, and compensation, each meticulously calibrated to accurately reflect worker quality. Furthermore, we’ve quantified the financial impact of attrition, ensuring clarity and ease of interpretation for our results.

When our model is applied to test data, its efficacy can be scrutinized and effectively communicated. Transparency is paramount in our reporting. Additionally, we’ve excluded variables that can only be determined post-hire from our training data, facilitating its utility for future candidate evaluations by the IBM acquisition team.

While our model serves as a valuable tool, it’s important to acknowledge its limitations. It’s not infallible and should be complemented with other considerations such as resume details and interview performance. Our recommendations are based on upfront data, like marital status or proximity to the office, and should be integrated with holistic hiring practices.

It’s essential to note that our model may not always identify the most qualified candidates, as it prioritizes retention probability over immediate output. This strategic focus might initially impact short-term productivity. However, the long-term benefits of reduced attrition and sustained workforce stability outweigh these potential shortfalls.

Data Exploration

Read in Data

We read in the data set and take a quick look at it.

employee <- read.csv("attrition.csv")
str(employee)
## 'data.frame':    1470 obs. of  35 variables:
##  $ Age                     : int  41 49 37 33 27 32 59 30 38 36 ...
##  $ Attrition               : chr  "Yes" "No" "Yes" "No" ...
##  $ BusinessTravel          : chr  "Travel_Rarely" "Travel_Frequently" "Travel_Rarely" "Travel_Frequently" ...
##  $ DailyRate               : int  1102 279 1373 1392 591 1005 1324 1358 216 1299 ...
##  $ Department              : chr  "Sales" "Research & Development" "Research & Development" "Research & Development" ...
##  $ DistanceFromHome        : int  1 8 2 3 2 2 3 24 23 27 ...
##  $ Education               : int  2 1 2 4 1 2 3 1 3 3 ...
##  $ EducationField          : chr  "Life Sciences" "Life Sciences" "Other" "Life Sciences" ...
##  $ EmployeeCount           : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ EmployeeNumber          : int  1 2 4 5 7 8 10 11 12 13 ...
##  $ EnvironmentSatisfaction : int  2 3 4 4 1 4 3 4 4 3 ...
##  $ Gender                  : chr  "Female" "Male" "Male" "Female" ...
##  $ HourlyRate              : int  94 61 92 56 40 79 81 67 44 94 ...
##  $ JobInvolvement          : int  3 2 2 3 3 3 4 3 2 3 ...
##  $ JobLevel                : int  2 2 1 1 1 1 1 1 3 2 ...
##  $ JobRole                 : chr  "Sales Executive" "Research Scientist" "Laboratory Technician" "Research Scientist" ...
##  $ JobSatisfaction         : int  4 2 3 3 2 4 1 3 3 3 ...
##  $ MaritalStatus           : chr  "Single" "Married" "Single" "Married" ...
##  $ MonthlyIncome           : int  5993 5130 2090 2909 3468 3068 2670 2693 9526 5237 ...
##  $ MonthlyRate             : int  19479 24907 2396 23159 16632 11864 9964 13335 8787 16577 ...
##  $ NumCompaniesWorked      : int  8 1 6 1 9 0 4 1 0 6 ...
##  $ Over18                  : chr  "Y" "Y" "Y" "Y" ...
##  $ OverTime                : chr  "Yes" "No" "Yes" "Yes" ...
##  $ PercentSalaryHike       : int  11 23 15 11 12 13 20 22 21 13 ...
##  $ PerformanceRating       : int  3 4 3 3 3 3 4 4 4 3 ...
##  $ RelationshipSatisfaction: int  1 4 2 3 4 3 1 2 2 2 ...
##  $ StandardHours           : int  80 80 80 80 80 80 80 80 80 80 ...
##  $ StockOptionLevel        : int  0 1 0 0 1 0 3 1 0 2 ...
##  $ TotalWorkingYears       : int  8 10 7 8 6 8 12 1 10 17 ...
##  $ TrainingTimesLastYear   : int  0 3 3 3 3 2 3 2 2 3 ...
##  $ WorkLifeBalance         : int  1 3 3 3 3 2 2 3 3 2 ...
##  $ YearsAtCompany          : int  6 10 0 8 2 7 1 1 9 7 ...
##  $ YearsInCurrentRole      : int  4 7 0 7 2 7 0 0 7 7 ...
##  $ YearsSinceLastPromotion : int  0 1 0 3 2 3 0 0 1 7 ...
##  $ YearsWithCurrManager    : int  5 7 0 0 2 6 0 0 8 7 ...
summary(employee)
##       Age         Attrition         BusinessTravel       DailyRate     
##  Min.   :18.00   Length:1470        Length:1470        Min.   : 102.0  
##  1st Qu.:30.00   Class :character   Class :character   1st Qu.: 465.0  
##  Median :36.00   Mode  :character   Mode  :character   Median : 802.0  
##  Mean   :36.92                                         Mean   : 802.5  
##  3rd Qu.:43.00                                         3rd Qu.:1157.0  
##  Max.   :60.00                                         Max.   :1499.0  
##   Department        DistanceFromHome   Education     EducationField    
##  Length:1470        Min.   : 1.000   Min.   :1.000   Length:1470       
##  Class :character   1st Qu.: 2.000   1st Qu.:2.000   Class :character  
##  Mode  :character   Median : 7.000   Median :3.000   Mode  :character  
##                     Mean   : 9.193   Mean   :2.913                     
##                     3rd Qu.:14.000   3rd Qu.:4.000                     
##                     Max.   :29.000   Max.   :5.000                     
##  EmployeeCount EmployeeNumber   EnvironmentSatisfaction    Gender         
##  Min.   :1     Min.   :   1.0   Min.   :1.000           Length:1470       
##  1st Qu.:1     1st Qu.: 491.2   1st Qu.:2.000           Class :character  
##  Median :1     Median :1020.5   Median :3.000           Mode  :character  
##  Mean   :1     Mean   :1024.9   Mean   :2.722                             
##  3rd Qu.:1     3rd Qu.:1555.8   3rd Qu.:4.000                             
##  Max.   :1     Max.   :2068.0   Max.   :4.000                             
##    HourlyRate     JobInvolvement    JobLevel       JobRole         
##  Min.   : 30.00   Min.   :1.00   Min.   :1.000   Length:1470       
##  1st Qu.: 48.00   1st Qu.:2.00   1st Qu.:1.000   Class :character  
##  Median : 66.00   Median :3.00   Median :2.000   Mode  :character  
##  Mean   : 65.89   Mean   :2.73   Mean   :2.064                     
##  3rd Qu.: 83.75   3rd Qu.:3.00   3rd Qu.:3.000                     
##  Max.   :100.00   Max.   :4.00   Max.   :5.000                     
##  JobSatisfaction MaritalStatus      MonthlyIncome    MonthlyRate   
##  Min.   :1.000   Length:1470        Min.   : 1009   Min.   : 2094  
##  1st Qu.:2.000   Class :character   1st Qu.: 2911   1st Qu.: 8047  
##  Median :3.000   Mode  :character   Median : 4919   Median :14236  
##  Mean   :2.729                      Mean   : 6503   Mean   :14313  
##  3rd Qu.:4.000                      3rd Qu.: 8379   3rd Qu.:20462  
##  Max.   :4.000                      Max.   :19999   Max.   :26999  
##  NumCompaniesWorked    Over18            OverTime         PercentSalaryHike
##  Min.   :0.000      Length:1470        Length:1470        Min.   :11.00    
##  1st Qu.:1.000      Class :character   Class :character   1st Qu.:12.00    
##  Median :2.000      Mode  :character   Mode  :character   Median :14.00    
##  Mean   :2.693                                            Mean   :15.21    
##  3rd Qu.:4.000                                            3rd Qu.:18.00    
##  Max.   :9.000                                            Max.   :25.00    
##  PerformanceRating RelationshipSatisfaction StandardHours StockOptionLevel
##  Min.   :3.000     Min.   :1.000            Min.   :80    Min.   :0.0000  
##  1st Qu.:3.000     1st Qu.:2.000            1st Qu.:80    1st Qu.:0.0000  
##  Median :3.000     Median :3.000            Median :80    Median :1.0000  
##  Mean   :3.154     Mean   :2.712            Mean   :80    Mean   :0.7939  
##  3rd Qu.:3.000     3rd Qu.:4.000            3rd Qu.:80    3rd Qu.:1.0000  
##  Max.   :4.000     Max.   :4.000            Max.   :80    Max.   :3.0000  
##  TotalWorkingYears TrainingTimesLastYear WorkLifeBalance YearsAtCompany  
##  Min.   : 0.00     Min.   :0.000         Min.   :1.000   Min.   : 0.000  
##  1st Qu.: 6.00     1st Qu.:2.000         1st Qu.:2.000   1st Qu.: 3.000  
##  Median :10.00     Median :3.000         Median :3.000   Median : 5.000  
##  Mean   :11.28     Mean   :2.799         Mean   :2.761   Mean   : 7.008  
##  3rd Qu.:15.00     3rd Qu.:3.000         3rd Qu.:3.000   3rd Qu.: 9.000  
##  Max.   :40.00     Max.   :6.000         Max.   :4.000   Max.   :40.000  
##  YearsInCurrentRole YearsSinceLastPromotion YearsWithCurrManager
##  Min.   : 0.000     Min.   : 0.000          Min.   : 0.000      
##  1st Qu.: 2.000     1st Qu.: 0.000          1st Qu.: 2.000      
##  Median : 3.000     Median : 1.000          Median : 3.000      
##  Mean   : 4.229     Mean   : 2.188          Mean   : 4.123      
##  3rd Qu.: 7.000     3rd Qu.: 3.000          3rd Qu.: 7.000      
##  Max.   :18.000     Max.   :15.000          Max.   :17.000

Graph Attrition

We make a simple bar plot of the attrition variable

library(ggplot2)
ggplot(employee, aes(x = factor(Attrition), fill = factor(Attrition))) +
  geom_bar(color = "black") +
  labs(title = "                             Distibution of Attrition",
       x = "",
       y = "Amount") +
  scale_x_discrete(labels = c("Not Attrition", "Attrition")) +
  scale_fill_manual(values = c("darkgrey", "white"), guide = FALSE) +
  # Set fill colors manually
  theme_minimal() +
  theme(panel.grid.major = element_blank(),
        panel.grid.minor = element_blank(),
        panel.border = element_blank(),
        axis.line = element_line(color = "black"))
## Warning: The `guide` argument in `scale_*()` cannot be `FALSE`. This was deprecated in
## ggplot2 3.3.4.
## ℹ Please use "none" instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

Clean data

We delete some useless variables and factor those that need to be factored. We then take another look at our updated data.

employee$Attrition <- as.factor(employee$Attrition)
employee$BusinessTravel <- as.factor(employee$BusinessTravel)
employee$Department <- as.factor(employee$Department)
employee$EducationField <- as.factor(employee$EducationField)
employee$Gender <- as.factor(employee$Gender)
employee$JobRole <- as.factor(employee$JobRole)
employee$MaritalStatus <- as.factor(employee$MaritalStatus)
employee$Over18 <- NULL
employee$OverTime <- as.factor(employee$OverTime)
str(employee)
## 'data.frame':    1470 obs. of  34 variables:
##  $ Age                     : int  41 49 37 33 27 32 59 30 38 36 ...
##  $ Attrition               : Factor w/ 2 levels "No","Yes": 2 1 2 1 1 1 1 1 1 1 ...
##  $ BusinessTravel          : Factor w/ 3 levels "Non-Travel","Travel_Frequently",..: 3 2 3 2 3 2 3 3 2 3 ...
##  $ DailyRate               : int  1102 279 1373 1392 591 1005 1324 1358 216 1299 ...
##  $ Department              : Factor w/ 3 levels "Human Resources",..: 3 2 2 2 2 2 2 2 2 2 ...
##  $ DistanceFromHome        : int  1 8 2 3 2 2 3 24 23 27 ...
##  $ Education               : int  2 1 2 4 1 2 3 1 3 3 ...
##  $ EducationField          : Factor w/ 6 levels "Human Resources",..: 2 2 5 2 4 2 4 2 2 4 ...
##  $ EmployeeCount           : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ EmployeeNumber          : int  1 2 4 5 7 8 10 11 12 13 ...
##  $ EnvironmentSatisfaction : int  2 3 4 4 1 4 3 4 4 3 ...
##  $ Gender                  : Factor w/ 2 levels "Female","Male": 1 2 2 1 2 2 1 2 2 2 ...
##  $ HourlyRate              : int  94 61 92 56 40 79 81 67 44 94 ...
##  $ JobInvolvement          : int  3 2 2 3 3 3 4 3 2 3 ...
##  $ JobLevel                : int  2 2 1 1 1 1 1 1 3 2 ...
##  $ JobRole                 : Factor w/ 9 levels "Healthcare Representative",..: 8 7 3 7 3 3 3 3 5 1 ...
##  $ JobSatisfaction         : int  4 2 3 3 2 4 1 3 3 3 ...
##  $ MaritalStatus           : Factor w/ 3 levels "Divorced","Married",..: 3 2 3 2 2 3 2 1 3 2 ...
##  $ MonthlyIncome           : int  5993 5130 2090 2909 3468 3068 2670 2693 9526 5237 ...
##  $ MonthlyRate             : int  19479 24907 2396 23159 16632 11864 9964 13335 8787 16577 ...
##  $ NumCompaniesWorked      : int  8 1 6 1 9 0 4 1 0 6 ...
##  $ OverTime                : Factor w/ 2 levels "No","Yes": 2 1 2 2 1 1 2 1 1 1 ...
##  $ PercentSalaryHike       : int  11 23 15 11 12 13 20 22 21 13 ...
##  $ PerformanceRating       : int  3 4 3 3 3 3 4 4 4 3 ...
##  $ RelationshipSatisfaction: int  1 4 2 3 4 3 1 2 2 2 ...
##  $ StandardHours           : int  80 80 80 80 80 80 80 80 80 80 ...
##  $ StockOptionLevel        : int  0 1 0 0 1 0 3 1 0 2 ...
##  $ TotalWorkingYears       : int  8 10 7 8 6 8 12 1 10 17 ...
##  $ TrainingTimesLastYear   : int  0 3 3 3 3 2 3 2 2 3 ...
##  $ WorkLifeBalance         : int  1 3 3 3 3 2 2 3 3 2 ...
##  $ YearsAtCompany          : int  6 10 0 8 2 7 1 1 9 7 ...
##  $ YearsInCurrentRole      : int  4 7 0 7 2 7 0 0 7 7 ...
##  $ YearsSinceLastPromotion : int  0 1 0 3 2 3 0 0 1 7 ...
##  $ YearsWithCurrManager    : int  5 7 0 0 2 6 0 0 8 7 ...
summary(employee)
##       Age        Attrition            BusinessTravel   DailyRate     
##  Min.   :18.00   No :1233   Non-Travel       : 150   Min.   : 102.0  
##  1st Qu.:30.00   Yes: 237   Travel_Frequently: 277   1st Qu.: 465.0  
##  Median :36.00              Travel_Rarely    :1043   Median : 802.0  
##  Mean   :36.92                                       Mean   : 802.5  
##  3rd Qu.:43.00                                       3rd Qu.:1157.0  
##  Max.   :60.00                                       Max.   :1499.0  
##                                                                      
##                   Department  DistanceFromHome   Education    
##  Human Resources       : 63   Min.   : 1.000   Min.   :1.000  
##  Research & Development:961   1st Qu.: 2.000   1st Qu.:2.000  
##  Sales                 :446   Median : 7.000   Median :3.000  
##                               Mean   : 9.193   Mean   :2.913  
##                               3rd Qu.:14.000   3rd Qu.:4.000  
##                               Max.   :29.000   Max.   :5.000  
##                                                               
##           EducationField EmployeeCount EmployeeNumber   EnvironmentSatisfaction
##  Human Resources : 27    Min.   :1     Min.   :   1.0   Min.   :1.000          
##  Life Sciences   :606    1st Qu.:1     1st Qu.: 491.2   1st Qu.:2.000          
##  Marketing       :159    Median :1     Median :1020.5   Median :3.000          
##  Medical         :464    Mean   :1     Mean   :1024.9   Mean   :2.722          
##  Other           : 82    3rd Qu.:1     3rd Qu.:1555.8   3rd Qu.:4.000          
##  Technical Degree:132    Max.   :1     Max.   :2068.0   Max.   :4.000          
##                                                                                
##     Gender      HourlyRate     JobInvolvement    JobLevel    
##  Female:588   Min.   : 30.00   Min.   :1.00   Min.   :1.000  
##  Male  :882   1st Qu.: 48.00   1st Qu.:2.00   1st Qu.:1.000  
##               Median : 66.00   Median :3.00   Median :2.000  
##               Mean   : 65.89   Mean   :2.73   Mean   :2.064  
##               3rd Qu.: 83.75   3rd Qu.:3.00   3rd Qu.:3.000  
##               Max.   :100.00   Max.   :4.00   Max.   :5.000  
##                                                              
##                       JobRole    JobSatisfaction  MaritalStatus MonthlyIncome  
##  Sales Executive          :326   Min.   :1.000   Divorced:327   Min.   : 1009  
##  Research Scientist       :292   1st Qu.:2.000   Married :673   1st Qu.: 2911  
##  Laboratory Technician    :259   Median :3.000   Single  :470   Median : 4919  
##  Manufacturing Director   :145   Mean   :2.729                  Mean   : 6503  
##  Healthcare Representative:131   3rd Qu.:4.000                  3rd Qu.: 8379  
##  Manager                  :102   Max.   :4.000                  Max.   :19999  
##  (Other)                  :215                                                 
##   MonthlyRate    NumCompaniesWorked OverTime   PercentSalaryHike
##  Min.   : 2094   Min.   :0.000      No :1054   Min.   :11.00    
##  1st Qu.: 8047   1st Qu.:1.000      Yes: 416   1st Qu.:12.00    
##  Median :14236   Median :2.000                 Median :14.00    
##  Mean   :14313   Mean   :2.693                 Mean   :15.21    
##  3rd Qu.:20462   3rd Qu.:4.000                 3rd Qu.:18.00    
##  Max.   :26999   Max.   :9.000                 Max.   :25.00    
##                                                                 
##  PerformanceRating RelationshipSatisfaction StandardHours StockOptionLevel
##  Min.   :3.000     Min.   :1.000            Min.   :80    Min.   :0.0000  
##  1st Qu.:3.000     1st Qu.:2.000            1st Qu.:80    1st Qu.:0.0000  
##  Median :3.000     Median :3.000            Median :80    Median :1.0000  
##  Mean   :3.154     Mean   :2.712            Mean   :80    Mean   :0.7939  
##  3rd Qu.:3.000     3rd Qu.:4.000            3rd Qu.:80    3rd Qu.:1.0000  
##  Max.   :4.000     Max.   :4.000            Max.   :80    Max.   :3.0000  
##                                                                           
##  TotalWorkingYears TrainingTimesLastYear WorkLifeBalance YearsAtCompany  
##  Min.   : 0.00     Min.   :0.000         Min.   :1.000   Min.   : 0.000  
##  1st Qu.: 6.00     1st Qu.:2.000         1st Qu.:2.000   1st Qu.: 3.000  
##  Median :10.00     Median :3.000         Median :3.000   Median : 5.000  
##  Mean   :11.28     Mean   :2.799         Mean   :2.761   Mean   : 7.008  
##  3rd Qu.:15.00     3rd Qu.:3.000         3rd Qu.:3.000   3rd Qu.: 9.000  
##  Max.   :40.00     Max.   :6.000         Max.   :4.000   Max.   :40.000  
##                                                                          
##  YearsInCurrentRole YearsSinceLastPromotion YearsWithCurrManager
##  Min.   : 0.000     Min.   : 0.000          Min.   : 0.000      
##  1st Qu.: 2.000     1st Qu.: 0.000          1st Qu.: 2.000      
##  Median : 3.000     Median : 1.000          Median : 3.000      
##  Mean   : 4.229     Mean   : 2.188          Mean   : 4.123      
##  3rd Qu.: 7.000     3rd Qu.: 3.000          3rd Qu.: 7.000      
##  Max.   :18.000     Max.   :15.000          Max.   :17.000      
## 

View Data

We run a for loop to look at a histogram of all of our numerical variables. This gives us a better idea of the makeup of the data.

library(ggplot2)

numerical_vars <- names(employee)[sapply(employee, is.numeric)]

for (var in numerical_vars) {
  plot_title <- paste("Histogram of", var)
  print(
    ggplot(employee, aes_string(x = var)) +
      geom_histogram(fill = "lightblue", color = "black") +
      labs(title = plot_title, x = var, y = "Frequency") +
      theme_minimal()
  )
}
## Warning: `aes_string()` was deprecated in ggplot2 3.0.0.
## ℹ Please use tidy evaluation idioms with `aes()`.
## ℹ See also `vignette("ggplot2-in-packages")` for more information.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

We notice that some variables are useless as they are the same for every observation. Therefore, we delete these variables. We also notice that performance rating is actually a binary variable so we choose to factor it.

employee$EmployeeCount <- NULL
employee$StandardHours <- NULL
employee$EmployeeNumber <- NULL
employee$PerformanceRating <- as.factor(employee$PerformanceRating)
str(employee)
## 'data.frame':    1470 obs. of  31 variables:
##  $ Age                     : int  41 49 37 33 27 32 59 30 38 36 ...
##  $ Attrition               : Factor w/ 2 levels "No","Yes": 2 1 2 1 1 1 1 1 1 1 ...
##  $ BusinessTravel          : Factor w/ 3 levels "Non-Travel","Travel_Frequently",..: 3 2 3 2 3 2 3 3 2 3 ...
##  $ DailyRate               : int  1102 279 1373 1392 591 1005 1324 1358 216 1299 ...
##  $ Department              : Factor w/ 3 levels "Human Resources",..: 3 2 2 2 2 2 2 2 2 2 ...
##  $ DistanceFromHome        : int  1 8 2 3 2 2 3 24 23 27 ...
##  $ Education               : int  2 1 2 4 1 2 3 1 3 3 ...
##  $ EducationField          : Factor w/ 6 levels "Human Resources",..: 2 2 5 2 4 2 4 2 2 4 ...
##  $ EnvironmentSatisfaction : int  2 3 4 4 1 4 3 4 4 3 ...
##  $ Gender                  : Factor w/ 2 levels "Female","Male": 1 2 2 1 2 2 1 2 2 2 ...
##  $ HourlyRate              : int  94 61 92 56 40 79 81 67 44 94 ...
##  $ JobInvolvement          : int  3 2 2 3 3 3 4 3 2 3 ...
##  $ JobLevel                : int  2 2 1 1 1 1 1 1 3 2 ...
##  $ JobRole                 : Factor w/ 9 levels "Healthcare Representative",..: 8 7 3 7 3 3 3 3 5 1 ...
##  $ JobSatisfaction         : int  4 2 3 3 2 4 1 3 3 3 ...
##  $ MaritalStatus           : Factor w/ 3 levels "Divorced","Married",..: 3 2 3 2 2 3 2 1 3 2 ...
##  $ MonthlyIncome           : int  5993 5130 2090 2909 3468 3068 2670 2693 9526 5237 ...
##  $ MonthlyRate             : int  19479 24907 2396 23159 16632 11864 9964 13335 8787 16577 ...
##  $ NumCompaniesWorked      : int  8 1 6 1 9 0 4 1 0 6 ...
##  $ OverTime                : Factor w/ 2 levels "No","Yes": 2 1 2 2 1 1 2 1 1 1 ...
##  $ PercentSalaryHike       : int  11 23 15 11 12 13 20 22 21 13 ...
##  $ PerformanceRating       : Factor w/ 2 levels "3","4": 1 2 1 1 1 1 2 2 2 1 ...
##  $ RelationshipSatisfaction: int  1 4 2 3 4 3 1 2 2 2 ...
##  $ StockOptionLevel        : int  0 1 0 0 1 0 3 1 0 2 ...
##  $ TotalWorkingYears       : int  8 10 7 8 6 8 12 1 10 17 ...
##  $ TrainingTimesLastYear   : int  0 3 3 3 3 2 3 2 2 3 ...
##  $ WorkLifeBalance         : int  1 3 3 3 3 2 2 3 3 2 ...
##  $ YearsAtCompany          : int  6 10 0 8 2 7 1 1 9 7 ...
##  $ YearsInCurrentRole      : int  4 7 0 7 2 7 0 0 7 7 ...
##  $ YearsSinceLastPromotion : int  0 1 0 3 2 3 0 0 1 7 ...
##  $ YearsWithCurrManager    : int  5 7 0 0 2 6 0 0 8 7 ...

We now take a look at a correlation matrix of all of our remaining numerical variables.

numerical_vars <- employee[, sapply(employee, is.numeric)]

correlation_matrix <- cor(numerical_vars)

print(correlation_matrix)
##                                   Age    DailyRate DistanceFromHome
## Age                       1.000000000  0.010660943     -0.001686120
## DailyRate                 0.010660943  1.000000000     -0.004985337
## DistanceFromHome         -0.001686120 -0.004985337      1.000000000
## Education                 0.208033731 -0.016806433      0.021041826
## EnvironmentSatisfaction   0.010146428  0.018354854     -0.016075327
## HourlyRate                0.024286543  0.023381422      0.031130586
## JobInvolvement            0.029819959  0.046134874      0.008783280
## JobLevel                  0.509604228  0.002966335      0.005302731
## JobSatisfaction          -0.004891877  0.030571008     -0.003668839
## MonthlyIncome             0.497854567  0.007707059     -0.017014445
## MonthlyRate               0.028051167 -0.032181602      0.027472864
## NumCompaniesWorked        0.299634758  0.038153434     -0.029250804
## PercentSalaryHike         0.003633585  0.022703677      0.040235377
## RelationshipSatisfaction  0.053534720  0.007846031      0.006557475
## StockOptionLevel          0.037509712  0.042142796      0.044871999
## TotalWorkingYears         0.680380536  0.014514739      0.004628426
## TrainingTimesLastYear    -0.019620819  0.002452543     -0.036942234
## WorkLifeBalance          -0.021490028 -0.037848051     -0.026556004
## YearsAtCompany            0.311308770 -0.034054768      0.009507720
## YearsInCurrentRole        0.212901056  0.009932015      0.018844999
## YearsSinceLastPromotion   0.216513368 -0.033228985      0.010028836
## YearsWithCurrManager      0.202088602 -0.026363178      0.014406048
##                             Education EnvironmentSatisfaction   HourlyRate
## Age                       0.208033731             0.010146428  0.024286543
## DailyRate                -0.016806433             0.018354854  0.023381422
## DistanceFromHome          0.021041826            -0.016075327  0.031130586
## Education                 1.000000000            -0.027128313  0.016774829
## EnvironmentSatisfaction  -0.027128313             1.000000000 -0.049856956
## HourlyRate                0.016774829            -0.049856956  1.000000000
## JobInvolvement            0.042437634            -0.008277598  0.042860641
## JobLevel                  0.101588886             0.001211699 -0.027853486
## JobSatisfaction          -0.011296117            -0.006784353 -0.071334624
## MonthlyIncome             0.094960677            -0.006259088 -0.015794304
## MonthlyRate              -0.026084197             0.037599623 -0.015296750
## NumCompaniesWorked        0.126316560             0.012594323  0.022156883
## PercentSalaryHike        -0.011110941            -0.031701195 -0.009061986
## RelationshipSatisfaction -0.009118377             0.007665384  0.001330453
## StockOptionLevel          0.018422220             0.003432158  0.050263399
## TotalWorkingYears         0.148279697            -0.002693070 -0.002333682
## TrainingTimesLastYear    -0.025100241            -0.019359308 -0.008547685
## WorkLifeBalance           0.009819189             0.027627295 -0.004607234
## YearsAtCompany            0.069113696             0.001457549 -0.019581616
## YearsInCurrentRole        0.060235554             0.018007460 -0.024106220
## YearsSinceLastPromotion   0.054254334             0.016193606 -0.026715586
## YearsWithCurrManager      0.069065378            -0.004998723 -0.020123200
##                          JobInvolvement     JobLevel JobSatisfaction
## Age                         0.029819959  0.509604228   -0.0048918771
## DailyRate                   0.046134874  0.002966335    0.0305710078
## DistanceFromHome            0.008783280  0.005302731   -0.0036688392
## Education                   0.042437634  0.101588886   -0.0112961167
## EnvironmentSatisfaction    -0.008277598  0.001211699   -0.0067843526
## HourlyRate                  0.042860641 -0.027853486   -0.0713346244
## JobInvolvement              1.000000000 -0.012629883   -0.0214759103
## JobLevel                   -0.012629883  1.000000000   -0.0019437080
## JobSatisfaction            -0.021475910 -0.001943708    1.0000000000
## MonthlyIncome              -0.015271491  0.950299913   -0.0071567424
## MonthlyRate                -0.016322079  0.039562951    0.0006439169
## NumCompaniesWorked          0.015012413  0.142501124   -0.0556994260
## PercentSalaryHike          -0.017204572 -0.034730492    0.0200020394
## RelationshipSatisfaction    0.034296821  0.021641511   -0.0124535932
## StockOptionLevel            0.021522640  0.013983911    0.0106902261
## TotalWorkingYears          -0.005533182  0.782207805   -0.0201850727
## TrainingTimesLastYear      -0.015337826 -0.018190550   -0.0057793350
## WorkLifeBalance            -0.014616593  0.037817746   -0.0194587102
## YearsAtCompany             -0.021355427  0.534738687   -0.0038026279
## YearsInCurrentRole          0.008716963  0.389446733   -0.0023047852
## YearsSinceLastPromotion    -0.024184292  0.353885347   -0.0182135678
## YearsWithCurrManager        0.025975808  0.375280608   -0.0276562139
##                          MonthlyIncome   MonthlyRate NumCompaniesWorked
## Age                        0.497854567  0.0280511671        0.299634758
## DailyRate                  0.007707059 -0.0321816015        0.038153434
## DistanceFromHome          -0.017014445  0.0274728635       -0.029250804
## Education                  0.094960677 -0.0260841972        0.126316560
## EnvironmentSatisfaction   -0.006259088  0.0375996229        0.012594323
## HourlyRate                -0.015794304 -0.0152967496        0.022156883
## JobInvolvement            -0.015271491 -0.0163220791        0.015012413
## JobLevel                   0.950299913  0.0395629510        0.142501124
## JobSatisfaction           -0.007156742  0.0006439169       -0.055699426
## MonthlyIncome              1.000000000  0.0348136261        0.149515216
## MonthlyRate                0.034813626  1.0000000000        0.017521353
## NumCompaniesWorked         0.149515216  0.0175213534        1.000000000
## PercentSalaryHike         -0.027268586 -0.0064293459       -0.010238309
## RelationshipSatisfaction   0.025873436 -0.0040853293        0.052733049
## StockOptionLevel           0.005407677 -0.0343228302        0.030075475
## TotalWorkingYears          0.772893246  0.0264424712        0.237638590
## TrainingTimesLastYear     -0.021736277  0.0014668806       -0.066054072
## WorkLifeBalance            0.030683082  0.0079631575       -0.008365685
## YearsAtCompany             0.514284826 -0.0236551067       -0.118421340
## YearsInCurrentRole         0.363817667 -0.0128148744       -0.090753934
## YearsSinceLastPromotion    0.344977638  0.0015667995       -0.036813892
## YearsWithCurrManager       0.344078883 -0.0367459053       -0.110319155
##                          PercentSalaryHike RelationshipSatisfaction
## Age                            0.003633585             0.0535347197
## DailyRate                      0.022703677             0.0078460310
## DistanceFromHome               0.040235377             0.0065574746
## Education                     -0.011110941            -0.0091183767
## EnvironmentSatisfaction       -0.031701195             0.0076653835
## HourlyRate                    -0.009061986             0.0013304528
## JobInvolvement                -0.017204572             0.0342968206
## JobLevel                      -0.034730492             0.0216415105
## JobSatisfaction                0.020002039            -0.0124535932
## MonthlyIncome                 -0.027268586             0.0258734361
## MonthlyRate                   -0.006429346            -0.0040853293
## NumCompaniesWorked            -0.010238309             0.0527330486
## PercentSalaryHike              1.000000000            -0.0404900811
## RelationshipSatisfaction      -0.040490081             1.0000000000
## StockOptionLevel               0.007527748            -0.0459524907
## TotalWorkingYears             -0.020608488             0.0240542918
## TrainingTimesLastYear         -0.005221012             0.0024965264
## WorkLifeBalance               -0.003279636             0.0196044057
## YearsAtCompany                -0.035991262             0.0193667869
## YearsInCurrentRole            -0.001520027            -0.0151229149
## YearsSinceLastPromotion       -0.022154313             0.0334925021
## YearsWithCurrManager          -0.011985248            -0.0008674968
##                          StockOptionLevel TotalWorkingYears
## Age                           0.037509712       0.680380536
## DailyRate                     0.042142796       0.014514739
## DistanceFromHome              0.044871999       0.004628426
## Education                     0.018422220       0.148279697
## EnvironmentSatisfaction       0.003432158      -0.002693070
## HourlyRate                    0.050263399      -0.002333682
## JobInvolvement                0.021522640      -0.005533182
## JobLevel                      0.013983911       0.782207805
## JobSatisfaction               0.010690226      -0.020185073
## MonthlyIncome                 0.005407677       0.772893246
## MonthlyRate                  -0.034322830       0.026442471
## NumCompaniesWorked            0.030075475       0.237638590
## PercentSalaryHike             0.007527748      -0.020608488
## RelationshipSatisfaction     -0.045952491       0.024054292
## StockOptionLevel              1.000000000       0.010135969
## TotalWorkingYears             0.010135969       1.000000000
## TrainingTimesLastYear         0.011274070      -0.035661571
## WorkLifeBalance               0.004128730       0.001007646
## YearsAtCompany                0.015058008       0.628133155
## YearsInCurrentRole            0.050817873       0.460364638
## YearsSinceLastPromotion       0.014352185       0.404857759
## YearsWithCurrManager          0.024698227       0.459188397
##                          TrainingTimesLastYear WorkLifeBalance YearsAtCompany
## Age                               -0.019620819    -0.021490028    0.311308770
## DailyRate                          0.002452543    -0.037848051   -0.034054768
## DistanceFromHome                  -0.036942234    -0.026556004    0.009507720
## Education                         -0.025100241     0.009819189    0.069113696
## EnvironmentSatisfaction           -0.019359308     0.027627295    0.001457549
## HourlyRate                        -0.008547685    -0.004607234   -0.019581616
## JobInvolvement                    -0.015337826    -0.014616593   -0.021355427
## JobLevel                          -0.018190550     0.037817746    0.534738687
## JobSatisfaction                   -0.005779335    -0.019458710   -0.003802628
## MonthlyIncome                     -0.021736277     0.030683082    0.514284826
## MonthlyRate                        0.001466881     0.007963158   -0.023655107
## NumCompaniesWorked                -0.066054072    -0.008365685   -0.118421340
## PercentSalaryHike                 -0.005221012    -0.003279636   -0.035991262
## RelationshipSatisfaction           0.002496526     0.019604406    0.019366787
## StockOptionLevel                   0.011274070     0.004128730    0.015058008
## TotalWorkingYears                 -0.035661571     0.001007646    0.628133155
## TrainingTimesLastYear              1.000000000     0.028072207    0.003568666
## WorkLifeBalance                    0.028072207     1.000000000    0.012089185
## YearsAtCompany                     0.003568666     0.012089185    1.000000000
## YearsInCurrentRole                -0.005737504     0.049856498    0.758753737
## YearsSinceLastPromotion           -0.002066536     0.008941249    0.618408865
## YearsWithCurrManager              -0.004095526     0.002759440    0.769212425
##                          YearsInCurrentRole YearsSinceLastPromotion
## Age                             0.212901056             0.216513368
## DailyRate                       0.009932015            -0.033228985
## DistanceFromHome                0.018844999             0.010028836
## Education                       0.060235554             0.054254334
## EnvironmentSatisfaction         0.018007460             0.016193606
## HourlyRate                     -0.024106220            -0.026715586
## JobInvolvement                  0.008716963            -0.024184292
## JobLevel                        0.389446733             0.353885347
## JobSatisfaction                -0.002304785            -0.018213568
## MonthlyIncome                   0.363817667             0.344977638
## MonthlyRate                    -0.012814874             0.001566800
## NumCompaniesWorked             -0.090753934            -0.036813892
## PercentSalaryHike              -0.001520027            -0.022154313
## RelationshipSatisfaction       -0.015122915             0.033492502
## StockOptionLevel                0.050817873             0.014352185
## TotalWorkingYears               0.460364638             0.404857759
## TrainingTimesLastYear          -0.005737504            -0.002066536
## WorkLifeBalance                 0.049856498             0.008941249
## YearsAtCompany                  0.758753737             0.618408865
## YearsInCurrentRole              1.000000000             0.548056248
## YearsSinceLastPromotion         0.548056248             1.000000000
## YearsWithCurrManager            0.714364762             0.510223636
##                          YearsWithCurrManager
## Age                              0.2020886024
## DailyRate                       -0.0263631782
## DistanceFromHome                 0.0144060484
## Education                        0.0690653783
## EnvironmentSatisfaction         -0.0049987226
## HourlyRate                      -0.0201232002
## JobInvolvement                   0.0259758079
## JobLevel                         0.3752806078
## JobSatisfaction                 -0.0276562139
## MonthlyIncome                    0.3440788833
## MonthlyRate                     -0.0367459053
## NumCompaniesWorked              -0.1103191554
## PercentSalaryHike               -0.0119852485
## RelationshipSatisfaction        -0.0008674968
## StockOptionLevel                 0.0246982266
## TotalWorkingYears                0.4591883971
## TrainingTimesLastYear           -0.0040955260
## WorkLifeBalance                  0.0027594402
## YearsAtCompany                   0.7692124251
## YearsInCurrentRole               0.7143647616
## YearsSinceLastPromotion          0.5102236358
## YearsWithCurrManager             1.0000000000

We see the only numerical variables that are heavily correlated are monthly income and job level. They have an R^2 of about \(0.95\). This makes sense and the correlation is not perfectly \(1\). Additionally, both variables seem very relevant to our question, so we choose to keep them both.

Make Variable to Predict

We decide that we want to predict on our data before any hire is made. This means that we have to delete variables that can only be determined by following a hire. This includes variables such as job satisfaction and years at company.
Before we do this however, we make a variable for prediction. We call it employee quality. “Quality” is representative of the amount of money an employee makes our company daily. This is why we are subtracting monthly income divided 30. Any data that is a positive indicator of job performance or output leads to a greater employee quality. Research led us to allocate the weight of each variable. For instance, increased job involvement from employees has been shown to lead to massive increases in a profit for companies.
Admittedly, this metric is quite arbitrary. It is most definitely not a perfect representation of profit per customer, as this is likely impossible to represent with a single number. This is because many factors can not be measured, and some things such as interaction between co workers can not be shown with one number. At any rate however, this provides us a benchmark that certainly has some significance.
After this is done, we look at a histogram of our new quality variable. Notice that it is mostly normally distributed, further ensuring its validity as an accurate measurement of quality. Also notice that the mean quality is \(-16\) dollars, suggesting that we are likely underestimating profit per worker.

employee$Quality <- with(employee, (100 * YearsAtCompany / Age) + (40 * JobInvolvement) 
                         + (20 * JobLevel) + ifelse(OverTime == "Yes", 30, 0) + 
                           ifelse(PerformanceRating == "4", 150, 0) - (MonthlyIncome / 30))

employee$YearsAtCompany <- NULL
employee$YearsInCurrentRole <- NULL
employee$JobInvolvement <- NULL
employee$JobLevel <- NULL
employee$OverTime <- NULL
employee$PerformanceRating <- NULL
employee$MonthlyIncome <- NULL
employee$DailyRate <- NULL
employee$MonthlyRate <- NULL
employee$HourlyRate <- NULL
employee$StockOptionLevel <- NULL
employee$YearsSinceLastPromotion <- NULL
employee$YearsWithCurrManager <- NULL
employee$PercentSalaryHike <- NULL
employee$TrainingTimesLastYear <- NULL
employee$JobSatisfaction <- NULL
str(employee)
## 'data.frame':    1470 obs. of  16 variables:
##  $ Age                     : int  41 49 37 33 27 32 59 30 38 36 ...
##  $ Attrition               : Factor w/ 2 levels "No","Yes": 2 1 2 1 1 1 1 1 1 1 ...
##  $ BusinessTravel          : Factor w/ 3 levels "Non-Travel","Travel_Frequently",..: 3 2 3 2 3 2 3 3 2 3 ...
##  $ Department              : Factor w/ 3 levels "Human Resources",..: 3 2 2 2 2 2 2 2 2 2 ...
##  $ DistanceFromHome        : int  1 8 2 3 2 2 3 24 23 27 ...
##  $ Education               : int  2 1 2 4 1 2 3 1 3 3 ...
##  $ EducationField          : Factor w/ 6 levels "Human Resources",..: 2 2 5 2 4 2 4 2 2 4 ...
##  $ EnvironmentSatisfaction : int  2 3 4 4 1 4 3 4 4 3 ...
##  $ Gender                  : Factor w/ 2 levels "Female","Male": 1 2 2 1 2 2 1 2 2 2 ...
##  $ JobRole                 : Factor w/ 9 levels "Healthcare Representative",..: 8 7 3 7 3 3 3 3 5 1 ...
##  $ MaritalStatus           : Factor w/ 3 levels "Divorced","Married",..: 3 2 3 2 2 3 2 1 3 2 ...
##  $ NumCompaniesWorked      : int  8 1 6 1 9 0 4 1 0 6 ...
##  $ RelationshipSatisfaction: int  1 4 2 3 4 3 1 2 2 2 ...
##  $ TotalWorkingYears       : int  8 10 7 8 6 8 12 1 10 17 ...
##  $ WorkLifeBalance         : int  1 3 3 3 3 2 2 3 3 2 ...
##  $ Quality                 : num  4.87 119.41 60.33 97.28 31.81 ...
summary(employee)
##       Age        Attrition            BusinessTravel
##  Min.   :18.00   No :1233   Non-Travel       : 150  
##  1st Qu.:30.00   Yes: 237   Travel_Frequently: 277  
##  Median :36.00              Travel_Rarely    :1043  
##  Mean   :36.92                                      
##  3rd Qu.:43.00                                      
##  Max.   :60.00                                      
##                                                     
##                   Department  DistanceFromHome   Education    
##  Human Resources       : 63   Min.   : 1.000   Min.   :1.000  
##  Research & Development:961   1st Qu.: 2.000   1st Qu.:2.000  
##  Sales                 :446   Median : 7.000   Median :3.000  
##                               Mean   : 9.193   Mean   :2.913  
##                               3rd Qu.:14.000   3rd Qu.:4.000  
##                               Max.   :29.000   Max.   :5.000  
##                                                               
##           EducationField EnvironmentSatisfaction    Gender   
##  Human Resources : 27    Min.   :1.000           Female:588  
##  Life Sciences   :606    1st Qu.:2.000           Male  :882  
##  Marketing       :159    Median :3.000                       
##  Medical         :464    Mean   :2.722                       
##  Other           : 82    3rd Qu.:4.000                       
##  Technical Degree:132    Max.   :4.000                       
##                                                              
##                       JobRole     MaritalStatus NumCompaniesWorked
##  Sales Executive          :326   Divorced:327   Min.   :0.000     
##  Research Scientist       :292   Married :673   1st Qu.:1.000     
##  Laboratory Technician    :259   Single  :470   Median :2.000     
##  Manufacturing Director   :145                  Mean   :2.693     
##  Healthcare Representative:131                  3rd Qu.:4.000     
##  Manager                  :102                  Max.   :9.000     
##  (Other)                  :215                                    
##  RelationshipSatisfaction TotalWorkingYears WorkLifeBalance    Quality       
##  Min.   :1.000            Min.   : 0.00     Min.   :1.000   Min.   :-499.40  
##  1st Qu.:2.000            1st Qu.: 6.00     1st Qu.:2.000   1st Qu.: -79.66  
##  Median :3.000            Median :10.00     Median :3.000   Median :  16.32  
##  Mean   :2.712            Mean   :11.28     Mean   :2.761   Mean   : -16.04  
##  3rd Qu.:4.000            3rd Qu.:15.00     3rd Qu.:3.000   3rd Qu.:  70.68  
##  Max.   :4.000            Max.   :40.00     Max.   :4.000   Max.   : 306.63  
## 
# Assuming employee$Quality contains your quality data
hist(employee$Quality, 
     main = "Histogram of Employee Quality", 
     col = 'black', 
     border = 'white',
     xlab = "$ Added per Day",
     breaks = seq(-500, 350, by = 50))  

# Adjust the breaks argument to set intervals of 50 on the x-axis

mean(employee$Quality)
## [1] -16.03619

Normalize Data

We have to normalize our data in order to run some of our models. This ensures that all of our data has equal weight. Otherwise, some variables would be have too much weight in determining our predictions.

employeedummy <- as.data.frame(model.matrix(~. -1, data=employee))

normalize <- function(x){
  (x - min(x))/(max(x) - min(x))
}

employee_n <- as.data.frame(lapply(employeedummy, normalize))

Prepare for Quality Prediction

We make Quality a binary variable for prediction. We decide that if a worker has an output of over \(50\) dollars a day, they are a quality worker. Once again, this is an arbitrary cutoff, but it gives us something to work with. It also ensures that even if our quality metric is prone to error, those who are rated as “high quality” are still very likely to produce a profit.
We then look at a graph showing the main problem with IBM’s hires. The quality workers are those who are most likely to quit. Despite quality workers making up about \(25\)% of workers, about \(50\)% of those who quit are among those quality workers. This is the fundamental issue with our current hires. We are hiring for quality, but many of them are quitting. We need a more holistic hire process, where we also search for loyal workers.

employee_n$Quality <- employee$Quality
employee_n$Quality <- ifelse(employee_n$Quality > 50, 1, 0)

library(ggplot2)
ggplot(employee_n, aes(x = factor(AttritionYes), fill = factor(Quality))) +
  geom_bar(position = "stack", color = "black") +
  labs(title = "                                Attrition vs Quality",
       x = "",
       y = "Count") +
  scale_x_discrete(labels = 
                     c("Not Attrition", "Attrition", "Not Attrition, Attrition")) +
  scale_fill_manual(values = c("white", "black"), guide = FALSE) +
  facet_wrap(~factor(Quality, labels = c("Not Quality", "Quality"))) +
  theme_minimal() +
  theme(panel.grid.major = element_blank(),
        panel.grid.minor = element_blank(),
        panel.border = element_blank(),
        axis.line = element_line(color = "black"),
        strip.background = element_blank(),
        strip.text.x = element_text(size = 12, face = "bold"))

Prepare for Attrition Prediction

We make a different very similar data set to predict attrition. We are careful not to include quality when prediction attrition, and vice versa. We are now prepared to make models predicting both attrition and quality. We take a quick at our datasets to confirm.

attrition <- employee_n
attrition$Quality <- NULL
attrition$AttritionNo <- NULL
employee_n$AttritionNo <- NULL
employee_n$AttritionYes <- NULL
str(employee_n)
## 'data.frame':    1470 obs. of  29 variables:
##  $ Age                             : num  0.548 0.738 0.452 0.357 0.214 ...
##  $ BusinessTravelTravel_Frequently : num  0 1 0 1 0 1 0 0 1 0 ...
##  $ BusinessTravelTravel_Rarely     : num  1 0 1 0 1 0 1 1 0 1 ...
##  $ DepartmentResearch...Development: num  0 1 1 1 1 1 1 1 1 1 ...
##  $ DepartmentSales                 : num  1 0 0 0 0 0 0 0 0 0 ...
##  $ DistanceFromHome                : num  0 0.25 0.0357 0.0714 0.0357 ...
##  $ Education                       : num  0.25 0 0.25 0.75 0 0.25 0.5 0 0.5 0.5 ...
##  $ EducationFieldLife.Sciences     : num  1 1 0 1 0 1 0 1 1 0 ...
##  $ EducationFieldMarketing         : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ EducationFieldMedical           : num  0 0 0 0 1 0 1 0 0 1 ...
##  $ EducationFieldOther             : num  0 0 1 0 0 0 0 0 0 0 ...
##  $ EducationFieldTechnical.Degree  : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ EnvironmentSatisfaction         : num  0.333 0.667 1 1 0 ...
##  $ GenderMale                      : num  0 1 1 0 1 1 0 1 1 1 ...
##  $ JobRoleHuman.Resources          : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ JobRoleLaboratory.Technician    : num  0 0 1 0 1 1 1 1 0 0 ...
##  $ JobRoleManager                  : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ JobRoleManufacturing.Director   : num  0 0 0 0 0 0 0 0 1 0 ...
##  $ JobRoleResearch.Director        : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ JobRoleResearch.Scientist       : num  0 1 0 1 0 0 0 0 0 0 ...
##  $ JobRoleSales.Executive          : num  1 0 0 0 0 0 0 0 0 0 ...
##  $ JobRoleSales.Representative     : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ MaritalStatusMarried            : num  0 1 0 1 1 0 1 0 0 1 ...
##  $ MaritalStatusSingle             : num  1 0 1 0 0 1 0 0 1 0 ...
##  $ NumCompaniesWorked              : num  0.889 0.111 0.667 0.111 1 ...
##  $ RelationshipSatisfaction        : num  0 1 0.333 0.667 1 ...
##  $ TotalWorkingYears               : num  0.2 0.25 0.175 0.2 0.15 0.2 0.3 0.025 0.25 0.425 ...
##  $ WorkLifeBalance                 : num  0 0.667 0.667 0.667 0.667 ...
##  $ Quality                         : num  0 1 1 1 0 1 1 1 0 0 ...
summary(employee_n)
##       Age         BusinessTravelTravel_Frequently BusinessTravelTravel_Rarely
##  Min.   :0.0000   Min.   :0.0000                  Min.   :0.0000             
##  1st Qu.:0.2857   1st Qu.:0.0000                  1st Qu.:0.0000             
##  Median :0.4286   Median :0.0000                  Median :1.0000             
##  Mean   :0.4506   Mean   :0.1884                  Mean   :0.7095             
##  3rd Qu.:0.5952   3rd Qu.:0.0000                  3rd Qu.:1.0000             
##  Max.   :1.0000   Max.   :1.0000                  Max.   :1.0000             
##  DepartmentResearch...Development DepartmentSales  DistanceFromHome 
##  Min.   :0.0000                   Min.   :0.0000   Min.   :0.00000  
##  1st Qu.:0.0000                   1st Qu.:0.0000   1st Qu.:0.03571  
##  Median :1.0000                   Median :0.0000   Median :0.21429  
##  Mean   :0.6537                   Mean   :0.3034   Mean   :0.29259  
##  3rd Qu.:1.0000                   3rd Qu.:1.0000   3rd Qu.:0.46429  
##  Max.   :1.0000                   Max.   :1.0000   Max.   :1.00000  
##    Education      EducationFieldLife.Sciences EducationFieldMarketing
##  Min.   :0.0000   Min.   :0.0000              Min.   :0.0000         
##  1st Qu.:0.2500   1st Qu.:0.0000              1st Qu.:0.0000         
##  Median :0.5000   Median :0.0000              Median :0.0000         
##  Mean   :0.4782   Mean   :0.4122              Mean   :0.1082         
##  3rd Qu.:0.7500   3rd Qu.:1.0000              3rd Qu.:0.0000         
##  Max.   :1.0000   Max.   :1.0000              Max.   :1.0000         
##  EducationFieldMedical EducationFieldOther EducationFieldTechnical.Degree
##  Min.   :0.0000        Min.   :0.00000     Min.   :0.0000                
##  1st Qu.:0.0000        1st Qu.:0.00000     1st Qu.:0.0000                
##  Median :0.0000        Median :0.00000     Median :0.0000                
##  Mean   :0.3156        Mean   :0.05578     Mean   :0.0898                
##  3rd Qu.:1.0000        3rd Qu.:0.00000     3rd Qu.:0.0000                
##  Max.   :1.0000        Max.   :1.00000     Max.   :1.0000                
##  EnvironmentSatisfaction   GenderMale  JobRoleHuman.Resources
##  Min.   :0.0000          Min.   :0.0   Min.   :0.00000       
##  1st Qu.:0.3333          1st Qu.:0.0   1st Qu.:0.00000       
##  Median :0.6667          Median :1.0   Median :0.00000       
##  Mean   :0.5739          Mean   :0.6   Mean   :0.03537       
##  3rd Qu.:1.0000          3rd Qu.:1.0   3rd Qu.:0.00000       
##  Max.   :1.0000          Max.   :1.0   Max.   :1.00000       
##  JobRoleLaboratory.Technician JobRoleManager    JobRoleManufacturing.Director
##  Min.   :0.0000               Min.   :0.00000   Min.   :0.00000              
##  1st Qu.:0.0000               1st Qu.:0.00000   1st Qu.:0.00000              
##  Median :0.0000               Median :0.00000   Median :0.00000              
##  Mean   :0.1762               Mean   :0.06939   Mean   :0.09864              
##  3rd Qu.:0.0000               3rd Qu.:0.00000   3rd Qu.:0.00000              
##  Max.   :1.0000               Max.   :1.00000   Max.   :1.00000              
##  JobRoleResearch.Director JobRoleResearch.Scientist JobRoleSales.Executive
##  Min.   :0.00000          Min.   :0.0000            Min.   :0.0000        
##  1st Qu.:0.00000          1st Qu.:0.0000            1st Qu.:0.0000        
##  Median :0.00000          Median :0.0000            Median :0.0000        
##  Mean   :0.05442          Mean   :0.1986            Mean   :0.2218        
##  3rd Qu.:0.00000          3rd Qu.:0.0000            3rd Qu.:0.0000        
##  Max.   :1.00000          Max.   :1.0000            Max.   :1.0000        
##  JobRoleSales.Representative MaritalStatusMarried MaritalStatusSingle
##  Min.   :0.00000             Min.   :0.0000       Min.   :0.0000     
##  1st Qu.:0.00000             1st Qu.:0.0000       1st Qu.:0.0000     
##  Median :0.00000             Median :0.0000       Median :0.0000     
##  Mean   :0.05646             Mean   :0.4578       Mean   :0.3197     
##  3rd Qu.:0.00000             3rd Qu.:1.0000       3rd Qu.:1.0000     
##  Max.   :1.00000             Max.   :1.0000       Max.   :1.0000     
##  NumCompaniesWorked RelationshipSatisfaction TotalWorkingYears WorkLifeBalance 
##  Min.   :0.0000     Min.   :0.0000           Min.   :0.000     Min.   :0.0000  
##  1st Qu.:0.1111     1st Qu.:0.3333           1st Qu.:0.150     1st Qu.:0.3333  
##  Median :0.2222     Median :0.6667           Median :0.250     Median :0.6667  
##  Mean   :0.2992     Mean   :0.5707           Mean   :0.282     Mean   :0.5871  
##  3rd Qu.:0.4444     3rd Qu.:1.0000           3rd Qu.:0.375     3rd Qu.:0.6667  
##  Max.   :1.0000     Max.   :1.0000           Max.   :1.000     Max.   :1.0000  
##     Quality      
##  Min.   :0.0000  
##  1st Qu.:0.0000  
##  Median :0.0000  
##  Mean   :0.3483  
##  3rd Qu.:1.0000  
##  Max.   :1.0000
summary(attrition)
##       Age          AttritionYes    BusinessTravelTravel_Frequently
##  Min.   :0.0000   Min.   :0.0000   Min.   :0.0000                 
##  1st Qu.:0.2857   1st Qu.:0.0000   1st Qu.:0.0000                 
##  Median :0.4286   Median :0.0000   Median :0.0000                 
##  Mean   :0.4506   Mean   :0.1612   Mean   :0.1884                 
##  3rd Qu.:0.5952   3rd Qu.:0.0000   3rd Qu.:0.0000                 
##  Max.   :1.0000   Max.   :1.0000   Max.   :1.0000                 
##  BusinessTravelTravel_Rarely DepartmentResearch...Development DepartmentSales 
##  Min.   :0.0000              Min.   :0.0000                   Min.   :0.0000  
##  1st Qu.:0.0000              1st Qu.:0.0000                   1st Qu.:0.0000  
##  Median :1.0000              Median :1.0000                   Median :0.0000  
##  Mean   :0.7095              Mean   :0.6537                   Mean   :0.3034  
##  3rd Qu.:1.0000              3rd Qu.:1.0000                   3rd Qu.:1.0000  
##  Max.   :1.0000              Max.   :1.0000                   Max.   :1.0000  
##  DistanceFromHome    Education      EducationFieldLife.Sciences
##  Min.   :0.00000   Min.   :0.0000   Min.   :0.0000             
##  1st Qu.:0.03571   1st Qu.:0.2500   1st Qu.:0.0000             
##  Median :0.21429   Median :0.5000   Median :0.0000             
##  Mean   :0.29259   Mean   :0.4782   Mean   :0.4122             
##  3rd Qu.:0.46429   3rd Qu.:0.7500   3rd Qu.:1.0000             
##  Max.   :1.00000   Max.   :1.0000   Max.   :1.0000             
##  EducationFieldMarketing EducationFieldMedical EducationFieldOther
##  Min.   :0.0000          Min.   :0.0000        Min.   :0.00000    
##  1st Qu.:0.0000          1st Qu.:0.0000        1st Qu.:0.00000    
##  Median :0.0000          Median :0.0000        Median :0.00000    
##  Mean   :0.1082          Mean   :0.3156        Mean   :0.05578    
##  3rd Qu.:0.0000          3rd Qu.:1.0000        3rd Qu.:0.00000    
##  Max.   :1.0000          Max.   :1.0000        Max.   :1.00000    
##  EducationFieldTechnical.Degree EnvironmentSatisfaction   GenderMale 
##  Min.   :0.0000                 Min.   :0.0000          Min.   :0.0  
##  1st Qu.:0.0000                 1st Qu.:0.3333          1st Qu.:0.0  
##  Median :0.0000                 Median :0.6667          Median :1.0  
##  Mean   :0.0898                 Mean   :0.5739          Mean   :0.6  
##  3rd Qu.:0.0000                 3rd Qu.:1.0000          3rd Qu.:1.0  
##  Max.   :1.0000                 Max.   :1.0000          Max.   :1.0  
##  JobRoleHuman.Resources JobRoleLaboratory.Technician JobRoleManager   
##  Min.   :0.00000        Min.   :0.0000               Min.   :0.00000  
##  1st Qu.:0.00000        1st Qu.:0.0000               1st Qu.:0.00000  
##  Median :0.00000        Median :0.0000               Median :0.00000  
##  Mean   :0.03537        Mean   :0.1762               Mean   :0.06939  
##  3rd Qu.:0.00000        3rd Qu.:0.0000               3rd Qu.:0.00000  
##  Max.   :1.00000        Max.   :1.0000               Max.   :1.00000  
##  JobRoleManufacturing.Director JobRoleResearch.Director
##  Min.   :0.00000               Min.   :0.00000         
##  1st Qu.:0.00000               1st Qu.:0.00000         
##  Median :0.00000               Median :0.00000         
##  Mean   :0.09864               Mean   :0.05442         
##  3rd Qu.:0.00000               3rd Qu.:0.00000         
##  Max.   :1.00000               Max.   :1.00000         
##  JobRoleResearch.Scientist JobRoleSales.Executive JobRoleSales.Representative
##  Min.   :0.0000            Min.   :0.0000         Min.   :0.00000            
##  1st Qu.:0.0000            1st Qu.:0.0000         1st Qu.:0.00000            
##  Median :0.0000            Median :0.0000         Median :0.00000            
##  Mean   :0.1986            Mean   :0.2218         Mean   :0.05646            
##  3rd Qu.:0.0000            3rd Qu.:0.0000         3rd Qu.:0.00000            
##  Max.   :1.0000            Max.   :1.0000         Max.   :1.00000            
##  MaritalStatusMarried MaritalStatusSingle NumCompaniesWorked
##  Min.   :0.0000       Min.   :0.0000      Min.   :0.0000    
##  1st Qu.:0.0000       1st Qu.:0.0000      1st Qu.:0.1111    
##  Median :0.0000       Median :0.0000      Median :0.2222    
##  Mean   :0.4578       Mean   :0.3197      Mean   :0.2992    
##  3rd Qu.:1.0000       3rd Qu.:1.0000      3rd Qu.:0.4444    
##  Max.   :1.0000       Max.   :1.0000      Max.   :1.0000    
##  RelationshipSatisfaction TotalWorkingYears WorkLifeBalance 
##  Min.   :0.0000           Min.   :0.000     Min.   :0.0000  
##  1st Qu.:0.3333           1st Qu.:0.150     1st Qu.:0.3333  
##  Median :0.6667           Median :0.250     Median :0.6667  
##  Mean   :0.5707           Mean   :0.282     Mean   :0.5871  
##  3rd Qu.:1.0000           3rd Qu.:0.375     3rd Qu.:0.6667  
##  Max.   :1.0000           Max.   :1.000     Max.   :1.0000

Test/Train

We create a test/train split for our data. We will build our models with the train data and evaluate our models with the test data. We choose a ratio of 0.5 to ensure that we have ample data for both testing and training. We split our attrition and quality data sets with the same splits. Additionally, we split our numerical variable with the same split for later evaluation.

ratio <- 0.5
set.seed(122121)
trainRows <- sample(1:nrow(employee_n), ratio*nrow(employee_n))

employeeTrain <- employee_n[trainRows, ]
employeeTest <- employee_n[-trainRows, ]

employeeTestLabel <- employeeTest$Quality
employeeTrainLabel <- employeeTrain$Quality

employeeTestPredictors <- employeeTest[,-29]
employeeTrainPredictors <- employeeTrain[,-29]

attritionTrain <- attrition[trainRows, ]
attritionTest <- attrition[-trainRows, ]

attritionTestLabel <- attritionTest$AttritonYes
attritionTrainLabel <- attritionTrain$AttritionYes

attritionTestPredictors <- attritionTest[,-2]
attritionTrainPredictors <- attritionTrain[,-2]

quality <- employee[-trainRows,]

Predicitng Quality

Make GLM Model

We use our train data to build a GLM model for quality. Once again, this only includes variables that can be acquired prior to a hire. We then check to see how our model predicts on the test data. We have a Kappa of \(0.40\).

library(caret)
## Loading required package: lattice
GlmModel <- glm(Quality~., data=employeeTrain, family="binomial")
summary(GlmModel)
## 
## Call:
## glm(formula = Quality ~ ., family = "binomial", data = employeeTrain)
## 
## Coefficients:
##                                    Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                      -1.475e+00  2.885e+03  -0.001   0.9996    
## Age                               2.875e-01  5.649e-01   0.509   0.6108    
## BusinessTravelTravel_Frequently   3.785e-01  3.655e-01   1.036   0.3003    
## BusinessTravelTravel_Rarely       1.900e-01  3.162e-01   0.601   0.5480    
## DepartmentResearch...Development -8.904e-01  2.885e+03   0.000   0.9998    
## DepartmentSales                  -4.120e-01  2.971e+03   0.000   0.9999    
## DistanceFromHome                  2.008e-02  3.287e-01   0.061   0.9513    
## Education                         9.411e-03  3.906e-01   0.024   0.9808    
## EducationFieldLife.Sciences       9.548e-01  8.015e-01   1.191   0.2335    
## EducationFieldMarketing           6.621e-01  8.869e-01   0.747   0.4553    
## EducationFieldMedical             8.171e-01  8.021e-01   1.019   0.3083    
## EducationFieldOther               8.951e-01  8.689e-01   1.030   0.3030    
## EducationFieldTechnical.Degree    7.070e-01  8.304e-01   0.851   0.3945    
## EnvironmentSatisfaction           1.171e-02  2.516e-01   0.047   0.9629    
## GenderMale                        1.187e-01  1.985e-01   0.598   0.5500    
## JobRoleHuman.Resources            1.582e+00  2.885e+03   0.001   0.9996    
## JobRoleLaboratory.Technician      2.384e+00  4.575e-01   5.210 1.88e-07 ***
## JobRoleManager                   -1.608e+01  1.329e+03  -0.012   0.9903    
## JobRoleManufacturing.Director     8.881e-01  5.075e-01   1.750   0.0801 .  
## JobRoleResearch.Director         -1.608e+01  1.068e+03  -0.015   0.9880    
## JobRoleResearch.Scientist         2.661e+00  4.549e-01   5.850 4.91e-09 ***
## JobRoleSales.Executive            8.691e-02  2.010e+03   0.000   1.0000    
## JobRoleSales.Representative       2.622e+00  2.010e+03   0.001   0.9990    
## MaritalStatusMarried             -6.244e-01  2.434e-01  -2.566   0.0103 *  
## MaritalStatusSingle              -7.098e-01  2.578e-01  -2.753   0.0059 ** 
## NumCompaniesWorked               -7.341e-01  3.638e-01  -2.018   0.0436 *  
## RelationshipSatisfaction          3.091e-01  2.649e-01   1.167   0.2433    
## TotalWorkingYears                -1.686e+00  8.829e-01  -1.910   0.0561 .  
## WorkLifeBalance                  -1.921e-01  3.866e-01  -0.497   0.6193    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 965.47  on 734  degrees of freedom
## Residual deviance: 697.65  on 706  degrees of freedom
## AIC: 755.65
## 
## Number of Fisher Scoring iterations: 17
glmPred <- predict(GlmModel, newdata=employeeTest, type = "response")
glmBin <- ifelse(glmPred >= 0.5, 1, 0)
confusionMatrix(as.factor(glmBin), as.factor(employeeTest$Quality), positive = "1")
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 380  87
##          1 112 156
##                                           
##                Accuracy : 0.7293          
##                  95% CI : (0.6956, 0.7611)
##     No Information Rate : 0.6694          
##     P-Value [Acc > NIR] : 0.0002658       
##                                           
##                   Kappa : 0.4038          
##                                           
##  Mcnemar's Test P-Value : 0.0888839       
##                                           
##             Sensitivity : 0.6420          
##             Specificity : 0.7724          
##          Pos Pred Value : 0.5821          
##          Neg Pred Value : 0.8137          
##              Prevalence : 0.3306          
##          Detection Rate : 0.2122          
##    Detection Prevalence : 0.3646          
##       Balanced Accuracy : 0.7072          
##                                           
##        'Positive' Class : 1               
## 

Make KNN Model

We build our KNN model for employee quality and save our predictions. KNN takes a test data point and finds the data points from the train data that our closest to our test data point. It then uses a majority vote to choose what we assign this variable as (1 or 0). “\(K\)” is the amount of variables close to our test data point from the train data that we use to predict. We try many \(k\)’s and we determined that a \(k\) of \(13\) works the best on the data. Our rule of thumb says it should be \(\sqrt{735} \approx 27\) but this gives us a much worse model. This is likely due to the low amount of variables. We then evaluate how our model performs on the test data. We see that we have a Kappa of \(0.40\).

library(class)
KnnModel <- knn(train = employeeTrainPredictors, test = employeeTestPredictors, cl = employeeTrainLabel, k = 13)
confusionMatrix(as.factor(KnnModel), as.factor(employeeTest$Quality), positive = "1")
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 392  97
##          1 100 146
##                                           
##                Accuracy : 0.732           
##                  95% CI : (0.6984, 0.7637)
##     No Information Rate : 0.6694          
##     P-Value [Acc > NIR] : 0.0001433       
##                                           
##                   Kappa : 0.3963          
##                                           
##  Mcnemar's Test P-Value : 0.8866897       
##                                           
##             Sensitivity : 0.6008          
##             Specificity : 0.7967          
##          Pos Pred Value : 0.5935          
##          Neg Pred Value : 0.8016          
##              Prevalence : 0.3306          
##          Detection Rate : 0.1986          
##    Detection Prevalence : 0.3347          
##       Balanced Accuracy : 0.6988          
##                                           
##        'Positive' Class : 1               
## 

Make ANN Model

We make our Neural Network model to predict quality. This model is based on the human brain. In each layer, every neuron is connected to every neuron in the previous layer. We use \(5\) hidden layers of \(60\), \(30\), \(10\), \(6\), and \(4\) for a total of \(110\) neurons. We increase our learning rate factor and threshold in order to ensure that our program runs in a reasonable time. We then make predictions on our test data. We will not save our binary predictions because we will allow our decision tree to find a good threshold. However, we will use them to evaluate our model. We see that we achieve a Kappa of \(0.43\).

library(neuralnet)
set.seed(422)


annmodel <- neuralnet(Quality ~ ., data = employeeTrain, hidden = c(60, 30, 10,6,4), threshold = 5,
  stepmax = 1e+05, rep = 1, startweights = NULL,
  learningrate.limit = NULL, learningrate.factor = list(minus = 0.5,
  plus = 1.2), learningrate = NULL, lifesign = "none",
  lifesign.step = 1000, algorithm = "rprop+", err.fct = "sse",
  act.fct = "logistic", linear.output = TRUE, exclude = NULL,
  constant.weights = NULL, likelihood = FALSE)
library(caret)
annPred <- predict(annmodel, employeeTest)
annBin <- ifelse(annPred >= 0.5, 1, 0)
confusionMatrix(as.factor(annBin), as.factor(employeeTest$Quality), positive = "1")
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 380  81
##          1 112 162
##                                          
##                Accuracy : 0.7374         
##                  95% CI : (0.704, 0.7689)
##     No Information Rate : 0.6694         
##     P-Value [Acc > NIR] : 3.853e-05      
##                                          
##                   Kappa : 0.4253         
##                                          
##  Mcnemar's Test P-Value : 0.03082        
##                                          
##             Sensitivity : 0.6667         
##             Specificity : 0.7724         
##          Pos Pred Value : 0.5912         
##          Neg Pred Value : 0.8243         
##              Prevalence : 0.3306         
##          Detection Rate : 0.2204         
##    Detection Prevalence : 0.3728         
##       Balanced Accuracy : 0.7195         
##                                          
##        'Positive' Class : 1              
## 

Make SVM Model

We make our SVM model and make binary predictions on quality. We try many different kernels and evaluate the kappa of each model on the test data. We save the model with the best kappa to be our SVM model that goes into our decision tree. After running our model, we see that Vanilla performs the best so we save that model. It has a Kappa of \(0.48\).

library(kernlab)
## 
## Attaching package: 'kernlab'
## The following object is masked from 'package:ggplot2':
## 
##     alpha
library(caret)

kernels <- c("vanilladot", "rbfdot", "polydot", "tanhdot",
             "laplacedot", "besseldot", "anovadot", "splinedot")
best_kappa <- -Inf
best_model <- NULL
best_predictions <- NULL

for (kernel in kernels) {
  classifier <- ksvm(factor(Quality) ~ ., data = employeeTrain, kernel = kernel)
  predictions <- predict(classifier, employeeTest)
  predictions <- as.factor(predictions)
  cm <- confusionMatrix(as.factor(predictions), as.factor(employeeTest$Quality), positive = "1")
  kappa_value <- cm$overall["Kappa"]
  
  if (kappa_value > best_kappa) {
    best_kappa <- kappa_value
    best_model <- kernel
    best_predictions <- predictions
  }
}
##  Setting default kernel parameters  
##  Setting default kernel parameters  
##  Setting default kernel parameters  
##  Setting default kernel parameters  
##  Setting default kernel parameters  
##  Setting default kernel parameters
# Save the predictions of the best model to a dataframe
svmPred <- data.frame(Predictions = as.character(best_predictions))
svmPredictions <- as.factor(svmPred$Predictions)

# Print the best model and its kappa
cat("Best Model:", best_model, "- Best Kappa:", best_kappa, "\n")
## Best Model: vanilladot - Best Kappa: 0.4758871

Decision Tree Model

Now we make our basic decision tree model for quality. Many branches are made with binary decisions for many different variables. At each leaf, \(1\) or \(0\) is chosen. We will feed these predictions into our larger decision tree model. First, we will evaluate these predictions. We see that we have a Kappa of \(0.39\).

library(C50)
dt <- C5.0(as.factor(Quality) ~., data = employeeTrain)
plot(dt)

dtpredict <- predict(dt, employeeTest)
confusionMatrix(as.factor(dtpredict), as.factor(employeeTest$Quality), positive = "1")
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 402 104
##          1  90 139
##                                           
##                Accuracy : 0.7361          
##                  95% CI : (0.7026, 0.7676)
##     No Information Rate : 0.6694          
##     P-Value [Acc > NIR] : 5.404e-05       
##                                           
##                   Kappa : 0.3948          
##                                           
##  Mcnemar's Test P-Value : 0.3506          
##                                           
##             Sensitivity : 0.5720          
##             Specificity : 0.8171          
##          Pos Pred Value : 0.6070          
##          Neg Pred Value : 0.7945          
##              Prevalence : 0.3306          
##          Detection Rate : 0.1891          
##    Detection Prevalence : 0.3116          
##       Balanced Accuracy : 0.6945          
##                                           
##        'Positive' Class : 1               
## 

Making a Data Frame of All of Our Models

We combine all of our previous model’s predictions into a single data frame for our employee quality predictions. We are also sure to put our quality prediction into the data set so that we are able to train our model. If a non-binary prediction exists we are sure to enter that one into the data frame as want our final decision tree to make the thresholds for us. We will build our final model later.

employeeModels <- data.frame(dtpredict, annPred, svmPredictions, KnnModel, glmPred, employeeTest$Quality)
str(employeeModels)
## 'data.frame':    735 obs. of  6 variables:
##  $ dtpredict           : Factor w/ 2 levels "0","1": 1 2 2 2 1 1 2 1 1 1 ...
##  $ annPred             : num  0.519 0.671 0.575 0.356 0.337 ...
##  $ svmPredictions      : Factor w/ 2 levels "0","1": 2 2 2 2 1 1 2 1 2 1 ...
##  $ KnnModel            : Factor w/ 2 levels "0","1": 1 1 2 2 1 1 2 1 2 1 ...
##  $ glmPred             : num  0.462 0.687 0.663 0.45 0.266 ...
##  $ employeeTest.Quality: num  1 1 1 1 0 0 1 0 1 0 ...

Prediciting Attrition

Make GLM Model

We now use our train data to make a GLM model for prediction attrition. We have a Kappa of \(0.23\), indicating early on that our attrition model is less powerful than our quality model.

library(caret)
GlmModel <- glm(AttritionYes~., data=attritionTrain, family="binomial")
summary(GlmModel)
## 
## Call:
## glm(formula = AttritionYes ~ ., family = "binomial", data = attritionTrain)
## 
## Coefficients:
##                                  Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                      -15.0533   563.8412  -0.027  0.97870    
## Age                               -0.3488     0.6730  -0.518  0.60427    
## BusinessTravelTravel_Frequently    1.1585     0.4425   2.618  0.00884 ** 
## BusinessTravelTravel_Rarely        0.4902     0.3977   1.233  0.21769    
## DepartmentResearch...Development  14.5732   563.8414   0.026  0.97938    
## DepartmentSales                   13.4705   563.8418   0.024  0.98094    
## DistanceFromHome                   1.5064     0.3543   4.252 2.12e-05 ***
## Education                          0.1602     0.4556   0.352  0.72510    
## EducationFieldLife.Sciences       -2.5083     1.0772  -2.329  0.01988 *  
## EducationFieldMarketing           -1.8959     1.1242  -1.687  0.09170 .  
## EducationFieldMedical             -2.6981     1.0751  -2.510  0.01209 *  
## EducationFieldOther               -2.7292     1.1594  -2.354  0.01858 *  
## EducationFieldTechnical.Degree    -1.8200     1.0832  -1.680  0.09294 .  
## EnvironmentSatisfaction           -0.8173     0.2962  -2.759  0.00579 ** 
## GenderMale                         0.2248     0.2338   0.962  0.33623    
## JobRoleHuman.Resources            14.5066   563.8411   0.026  0.97947    
## JobRoleLaboratory.Technician       0.9750     0.5226   1.866  0.06206 .  
## JobRoleManager                     0.8864     0.9116   0.972  0.33089    
## JobRoleManufacturing.Director     -0.1167     0.6672  -0.175  0.86120    
## JobRoleResearch.Director          -0.7997     1.1388  -0.702  0.48258    
## JobRoleResearch.Scientist          0.5470     0.5243   1.043  0.29682    
## JobRoleSales.Executive             1.7125     1.4095   1.215  0.22436    
## JobRoleSales.Representative        3.1210     1.4600   2.138  0.03254 *  
## MaritalStatusMarried               0.6628     0.3381   1.960  0.04995 *  
## MaritalStatusSingle                1.6590     0.3388   4.897 9.73e-07 ***
## NumCompaniesWorked                 1.4358     0.3981   3.607  0.00031 ***
## RelationshipSatisfaction          -0.7563     0.3035  -2.492  0.01272 *  
## TotalWorkingYears                 -2.0314     1.0776  -1.885  0.05942 .  
## WorkLifeBalance                   -0.8707     0.4505  -1.933  0.05328 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 704.04  on 734  degrees of freedom
## Residual deviance: 550.47  on 706  degrees of freedom
## AIC: 608.47
## 
## Number of Fisher Scoring iterations: 14
glmPred <- predict(GlmModel, newdata=attritionTest, type = "response")
glmBin <- ifelse(glmPred >= 0.5, 1, 0)
confusionMatrix(as.factor(glmBin), as.factor(attritionTest$AttritionYes), positive = "1")
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 605  78
##          1  29  23
##                                           
##                Accuracy : 0.8544          
##                  95% CI : (0.8268, 0.8791)
##     No Information Rate : 0.8626          
##     P-Value [Acc > NIR] : 0.759           
##                                           
##                   Kappa : 0.2286          
##                                           
##  Mcnemar's Test P-Value : 3.478e-06       
##                                           
##             Sensitivity : 0.22772         
##             Specificity : 0.95426         
##          Pos Pred Value : 0.44231         
##          Neg Pred Value : 0.88580         
##              Prevalence : 0.13741         
##          Detection Rate : 0.03129         
##    Detection Prevalence : 0.07075         
##       Balanced Accuracy : 0.59099         
##                                           
##        'Positive' Class : 1               
## 

Make KNN Model

We build our KNN model for attrition. This time, we find that a \(k\) of \(6\) works best. Once again, our model performs worse than it did for quality with a Kappa of \(0.19\).

library(class)
set.seed(8)
KnnModel <- knn(train = attritionTrainPredictors, test = attritionTestPredictors, cl = attritionTrainLabel, k = 6)
confusionMatrix(as.factor(KnnModel), as.factor(attritionTest$AttritionYes), positive = "1")
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 607  82
##          1  27  19
##                                           
##                Accuracy : 0.8517          
##                  95% CI : (0.8239, 0.8766)
##     No Information Rate : 0.8626          
##     P-Value [Acc > NIR] : 0.8194          
##                                           
##                   Kappa : 0.1887          
##                                           
##  Mcnemar's Test P-Value : 2.313e-07       
##                                           
##             Sensitivity : 0.18812         
##             Specificity : 0.95741         
##          Pos Pred Value : 0.41304         
##          Neg Pred Value : 0.88099         
##              Prevalence : 0.13741         
##          Detection Rate : 0.02585         
##    Detection Prevalence : 0.06259         
##       Balanced Accuracy : 0.57277         
##                                           
##        'Positive' Class : 1               
## 

Make ANN Model

We make a our Neural Network model for predicting attrition. We once again use \(5\) hidden layers of \(60\), \(30\), \(10\), \(6\), and \(4\) for a total of \(110\) neurons. We have a final Kappa of \(0.23\), once again worse than the quality prediction.

library(neuralnet)
set.seed(422)


annmodel <- neuralnet(AttritionYes ~ ., data = attritionTrain, hidden = c(60, 30, 10,6,4), threshold = 2,
  stepmax = 1e+05, rep = 1, startweights = NULL,
  learningrate.limit = NULL, learningrate.factor = list(minus = 0.5,
  plus = 1.2), learningrate = NULL, lifesign = "none",
  lifesign.step = 1000, algorithm = "rprop+", err.fct = "sse",
  act.fct = "logistic", linear.output = TRUE, exclude = NULL,
  constant.weights = NULL, likelihood = FALSE)
library(caret)
annPred <- predict(annmodel, attritionTest)
annBin <- ifelse(annPred >= 0.5, 1, 0)
confusionMatrix(as.factor(annBin), as.factor(attritionTest$AttritionYes), positive = "1")
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 591  74
##          1  43  27
##                                           
##                Accuracy : 0.8408          
##                  95% CI : (0.8123, 0.8665)
##     No Information Rate : 0.8626          
##     P-Value [Acc > NIR] : 0.959289        
##                                           
##                   Kappa : 0.2291          
##                                           
##  Mcnemar's Test P-Value : 0.005546        
##                                           
##             Sensitivity : 0.26733         
##             Specificity : 0.93218         
##          Pos Pred Value : 0.38571         
##          Neg Pred Value : 0.88872         
##              Prevalence : 0.13741         
##          Detection Rate : 0.03673         
##    Detection Prevalence : 0.09524         
##       Balanced Accuracy : 0.59975         
##                                           
##        'Positive' Class : 1               
## 

Make SVM Model

We now make an SVM model for attrition. We once again try many different kernels, with Anova performing the best this time. It has a Kappa of \(0.20\), once again much lower than our Kappa for quality.

library(kernlab)
library(caret)

kernels <- c("vanilladot", "rbfdot", "polydot", "tanhdot", "laplacedot", "besseldot", "anovadot", "splinedot")
best_kappa <- -Inf
best_model <- NULL
best_predictions <- NULL

for (kernel in kernels) {
  classifier <- ksvm(factor(AttritionYes) ~ ., data = attritionTrain, kernel = kernel)
  predictions <- predict(classifier, attritionTest)
  predictions <- as.factor(predictions)
  cm <- confusionMatrix(as.factor(predictions), as.factor(attritionTest$AttritionYes), positive = "1")
  kappa_value <- cm$overall["Kappa"]
  
  if (kappa_value > best_kappa) {
    best_kappa <- kappa_value
    best_model <- kernel
    best_predictions <- predictions
  }
}
##  Setting default kernel parameters  
##  Setting default kernel parameters  
##  Setting default kernel parameters  
##  Setting default kernel parameters  
##  Setting default kernel parameters  
##  Setting default kernel parameters
# Save the predictions of the best model to a dataframe
svmPred <- data.frame(Predictions = as.character(best_predictions))
svmPredictions <- as.factor(svmPred$Predictions)

# Print the best model and its kappa
cat("Best Model:", best_model, "- Best Kappa:", best_kappa, "\n")
## Best Model: anovadot - Best Kappa: 0.1987956

Decision Tree Model

Now we make our basic decision tree model for attrition. We see that we have a Kappa of \(0.20\), per usual this signifies less predicting power than our quality model.

library(C50)
dt <- C5.0(as.factor(AttritionYes) ~., data = attritionTrain)
plot(dt)

dtpredict <- predict(dt, attritionTest)
confusionMatrix(as.factor(dtpredict), as.factor(attritionTest$AttritionYes), positive = "1")
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 589  76
##          1  45  25
##                                           
##                Accuracy : 0.8354          
##                  95% CI : (0.8065, 0.8615)
##     No Information Rate : 0.8626          
##     P-Value [Acc > NIR] : 0.984274        
##                                           
##                   Kappa : 0.2027          
##                                           
##  Mcnemar's Test P-Value : 0.006386        
##                                           
##             Sensitivity : 0.24752         
##             Specificity : 0.92902         
##          Pos Pred Value : 0.35714         
##          Neg Pred Value : 0.88571         
##              Prevalence : 0.13741         
##          Detection Rate : 0.03401         
##    Detection Prevalence : 0.09524         
##       Balanced Accuracy : 0.58827         
##                                           
##        'Positive' Class : 1               
## 

Making a Data Frame of All of Our Models

We combine all of our previous model’s predictions into a single data frame for attrition predictions. We are also sure to put our response variable, attrtion, into the data set so that we are able to train our model. If a non-binary prediction exists we are sure to enter that one into the data frame as want our final decision tree to make the thresholds for us.

attritionModels <- data.frame(dtpredict, annPred, svmPredictions, KnnModel, glmPred, attritionTest$AttritionYes)
str(attritionModels)
## 'data.frame':    735 obs. of  6 variables:
##  $ dtpredict                 : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ annPred                   : num  0.0499 -0.0227 0.4672 0.0863 0.1527 ...
##  $ svmPredictions            : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 2 1 1 1 ...
##  $ KnnModel                  : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 2 1 1 1 ...
##  $ glmPred                   : num  0.274 0.065 0.267 0.118 0.245 ...
##  $ attritionTest.AttritionYes: num  1 0 0 0 0 0 1 0 0 0 ...

Making Final Model

Break Data Frame into Test/Train

We now break our final data frame into test and train with a ratio \(.7/.3\). This will allow us to train our stacked decision trees and test on them. We split our data for both attrition and quality. Additionally, we give the same split to our numerial quality variable for later evaluation.

ratio <- 0.7
set.seed(69)
trainRowsFinal <- sample(1:nrow(employeeModels), ratio*nrow(employeeModels))
employeeTrain <- employeeModels[trainRowsFinal, ]
employeeTest <- employeeModels[-trainRowsFinal, ]

attritionTrain <- attritionModels[trainRowsFinal,]
attritionTest <- attritionModels[-trainRowsFinal,]

quality <- quality[-trainRowsFinal,]
quality <- quality$Quality

Quality Final Tree with Cost Matrix

Now we make our stacked decision tree for quality. We use different costs for false positives and false negatives. We associate a cost of \(1.\) for false negatives and \(1.25\) for false positives. The standard setting for a decision tree assigns a cost of \(1\) for both. In this way, we are looking to avoid false positives as we want to ensure that we are hiring quality workers.

#1.25, 1
cost_matrix <- matrix(c(0,1.25,1,0), nrow = 2) 
finalDt <- C5.0(as.factor(employeeTest.Quality) ~., data = employeeTrain, costs = cost_matrix)
## Warning: no dimnames were given for the cost matrix; the factor levels will be
## used
plot(finalDt)

employeepredict <- predict(finalDt, employeeTest)
confusionMatrix(as.factor(employeepredict), as.factor(employeeTest$employeeTest.Quality), positive = "1")
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 129  26
##          1  22  44
##                                           
##                Accuracy : 0.7828          
##                  95% CI : (0.7226, 0.8353)
##     No Information Rate : 0.6833          
##     P-Value [Acc > NIR] : 0.0006775       
##                                           
##                   Kappa : 0.4904          
##                                           
##  Mcnemar's Test P-Value : 0.6650055       
##                                           
##             Sensitivity : 0.6286          
##             Specificity : 0.8543          
##          Pos Pred Value : 0.6667          
##          Neg Pred Value : 0.8323          
##              Prevalence : 0.3167          
##          Detection Rate : 0.1991          
##    Detection Prevalence : 0.2986          
##       Balanced Accuracy : 0.7414          
##                                           
##        'Positive' Class : 1               
## 

Our stacked model performed better than any of the base models, with a Kappa of \(0.49\). Also, our cost matrix led to a less amount of false positives, meaning we are mostly selecting quality workers.

Attrition Final Tree with Cost Matrix

It is paramount that we do not hire workers who are will just quit. Therefore we assign a cost of \(5\) for false negatives and only \(1\) for false positives. This will ensure that almost everyone we predict will not quit, will not actually quit.

#1, 5
cost_matrix <- matrix(c(0,1,5,0), nrow = 2) 
finalDt <- C5.0(as.factor(attritionTest.AttritionYes) ~., data = attritionTrain, costs = cost_matrix)
## Warning: no dimnames were given for the cost matrix; the factor levels will be
## used
plot(finalDt)

attritionpredict <- predict(finalDt, attritionTest)
confusionMatrix(as.factor(attritionpredict), as.factor(attritionTest$attritionTest.AttritionYes), positive = "1")
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 144  11
##          1  39  27
##                                           
##                Accuracy : 0.7738          
##                  95% CI : (0.7128, 0.8272)
##     No Information Rate : 0.8281          
##     P-Value [Acc > NIR] : 0.9846951       
##                                           
##                   Kappa : 0.385           
##                                           
##  Mcnemar's Test P-Value : 0.0001343       
##                                           
##             Sensitivity : 0.7105          
##             Specificity : 0.7869          
##          Pos Pred Value : 0.4091          
##          Neg Pred Value : 0.9290          
##              Prevalence : 0.1719          
##          Detection Rate : 0.1222          
##    Detection Prevalence : 0.2986          
##       Balanced Accuracy : 0.7487          
##                                           
##        'Positive' Class : 1               
## 

We have a kappa of about \(0.39\), which is significantly better than all of our base models. This shows the power of the cost matrix. Additionally, we have an extremely small number of false negatives, as intended. This shows that our model has performed very well for our goal.

Plot all of our Kappas

We make a plot to look at the performance of all our models. This allows us to better understand our performance. Notice the massive improvement in our attrition Kappa with our ultimate, stacked model.

library(ggplot2)
library(tidyr)

# Example data
Quality <- c(0.40, 0.40, 0.43, 0.48, 0.39, 0.49)
Attrition <- c(0.23, 0.19, 0.23, 0.20, 0.20, 0.39)
Models <- c("GLM", "KNN", "ANN", "SVM", "DT", "Ultimate")
data <- data.frame(Models, Quality, Attrition)

# Reshape the data
melted_data <- data %>%
  pivot_longer(cols = c(Quality, Attrition), names_to = "Variable", values_to = "Kappa")

# Plot
ggplot(melted_data, aes(x = Models, y = Kappa, fill = Variable)) +
  geom_bar(stat = "identity", position = position_dodge(width = 0.8), color = "darkgrey") +
  scale_fill_manual(values = c("black", "white")) +  # Set colors manually
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

Make Final Predictions

We decide to hire someone if we predict that they are both a quality worker, and they will not quit. We call this “doHire.” We make confusion matrices to see how many of hires are quality and how many will eventually quit. We see that we chose to hire only \(4\) workers who will eventually quit, less than \(15%\) of our hires.
We also check how accurate our model is on choosing to hire. We check the actual quality and attrition values and then use the same criteria for hire. We then compare these results to our predictions. We did quite well, with \(18\) hires being good hires and only \(15\) being poor. With all of the variables we had deleted, this is a staggering performance. We have a final Kappa of \(0.32\).

doHire <- ifelse(employeepredict == 1 & attritionpredict == 0, 1, 0)

conf <- confusionMatrix(as.factor(doHire), as.factor(attritionTest$attritionTest.AttritionYes), positive = "1")

confusionMatrix(as.factor(doHire), as.factor(employeeTest$employeeTest.Quality), positive = "1")
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 139  49
##          1  12  21
##                                         
##                Accuracy : 0.724         
##                  95% CI : (0.66, 0.7818)
##     No Information Rate : 0.6833        
##     P-Value [Acc > NIR] : 0.1086        
##                                         
##                   Kappa : 0.257         
##                                         
##  Mcnemar's Test P-Value : 4.04e-06      
##                                         
##             Sensitivity : 0.30000       
##             Specificity : 0.92053       
##          Pos Pred Value : 0.63636       
##          Neg Pred Value : 0.73936       
##              Prevalence : 0.31674       
##          Detection Rate : 0.09502       
##    Detection Prevalence : 0.14932       
##       Balanced Accuracy : 0.61026       
##                                         
##        'Positive' Class : 1             
## 
shouldHire <-ifelse(employeeTest$employeeTest.Quality == 1 & attritionTest$attritionTest.AttritionYes == 0, 1, 0)

conf_matrix <- confusionMatrix(as.factor(doHire), as.factor(shouldHire), positive = "1")

print(conf_matrix)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 157  31
##          1  15  18
##                                           
##                Accuracy : 0.7919          
##                  95% CI : (0.7323, 0.8434)
##     No Information Rate : 0.7783          
##     P-Value [Acc > NIR] : 0.34751         
##                                           
##                   Kappa : 0.3172          
##                                           
##  Mcnemar's Test P-Value : 0.02699         
##                                           
##             Sensitivity : 0.36735         
##             Specificity : 0.91279         
##          Pos Pred Value : 0.54545         
##          Neg Pred Value : 0.83511         
##              Prevalence : 0.22172         
##          Detection Rate : 0.08145         
##    Detection Prevalence : 0.14932         
##       Balanced Accuracy : 0.64007         
##                                           
##        'Positive' Class : 1               
## 
conf_matrix_df <- as.data.frame(as.matrix(conf_matrix$table))

conf_df <- as.data.frame(as.matrix(conf$table))


ggplot(data = conf_matrix_df, aes(x = Reference, y = Prediction)) +
  geom_tile(aes(fill = Freq), colour = "black") +  
  # Change the color of borders to black
  geom_text(aes(label = Freq), color = "black", family = "Arial") + 
  # Change text color to black and font family to Arial
  theme_minimal() +
  scale_fill_gradient(low = "white", high = "darkgrey") +
  labs(x = "Ideal Hire Choice", y = "Model's Hire Choice",
       title = "                       Confusion Matrix of Hires") +
  theme(axis.text = element_text(size = 12, family = "Times New Roman"),
        # Change font family for axis text
        axis.title = element_text(size = 14, face = "bold", family = "Times New Roman"),  
        # Change font family for axis titles
        plot.title = element_text(size = 16, face = "bold", family = "Times New Roman"))

ggplot(data = conf_df, aes(x = Reference, y = Prediction)) +
  geom_tile(aes(fill = Freq), colour = "black") +  
  # Change the color of borders to black
  geom_text(aes(label = Freq), color = "black", family = "Arial") +  
  # Change text color to black and font family to Arial
  theme_minimal() +
  scale_fill_gradient(low = "white", high = "darkgrey") +
  labs(x = "Hire would Quit", y = "Recommended Hire",
       title = "         Recommended Hire Compared to Attrition") +
  theme(axis.text = element_text(size = 12, family = "Times New Roman"),  
        # Change font family for axis text
        axis.title = element_text(size = 14, face = "bold", family = "Times New Roman"),  
        # Change font family for axis titles
        plot.title = element_text(size = 16, face = "bold", family = "Times New Roman"))

Finding our Final Profit

We kept our numerical quality from earlier for a reason. We want to evaluate our total final profit. First, we look a histogram of the Quality of our recommended hires. Notice that even those below 50, which we rate as “Not Quality,” are still above \(0\) or not far below it. This is a very good sign. Finally, we take our \(4\) workers who will quit out of our data set and we add up the remaining hires quality. This gives us our final increase in dollars per hour. We have a net profit of about \(2,500\) dollars. This means we have \[\frac{2,500}{33} \approx 75 \text{ dollars per hire}\]

qualityOfHires = data.frame(quality, doHire, attritionTest$attritionTest.AttritionYes) 

subset <- qualityOfHires[qualityOfHires$doHire == 1,]

hist(subset$quality, 
     main = "Employee Quality of Recommended Hires", 
     col = 'black', 
     border = 'white',
     xlab = "$ Added per Day",
     breaks = seq(-50, 250, by = 25))

subset <- subset[subset$attritionTest.attritionTest.AttritionYes == 0,]

sum(subset$quality)
## [1] 2518.822

Conclusions

Our model was largely successful and powerful on predicting who we should hire based on our metrics. This is made evident by the average profit per day per worker of \(75\) dollars. This number may be based on our arbitrary assumptions, but it is such as massive increase from the initial \(-16\) dollars that its power can not be denied. And at any rate, we are increasing variables that are certainly correlated with increased job performance. Its power is meaningful as we initially deleted any variables highly correlated with on the variables we were predicting. The only variables remaining were those giving basic information on a candidate such as education and distance from home. This ensures that our model can be used when trying to make hires.

With the power of the cost matrix, we were able to nearly entirely avoid workers who will eventually quit. This would lead to much more long-term profits. Our recommended hires are of both high quality and likely to be loyal. This is what led to about \(75\) extra dollars per day per worker with our recommended hires.

With that said, our model was not perfect. Our model did recommend hiring some individuals who would not lead to increased profits. Due to this, we recommend that our model be used in conjunction with other hiring methods. This may include typical resume drops or interviews. In this way, with a holistic approach, IBM can make the most profitable hires possible.

Expanding on that previous note, our current model is based on the idea that we are using the model as the sole method for hiring. The results are good, but they can be improved upon with a holistic approach. This entails using our model to screen candidates for interviews. Therefore, we would entirely change the cost matrices. We currently have \((0, 5, 4, 0)\) for quality. If we were screening, we may change this to \((0,1,1,0)\). Likewise, we may change our attrition cost matrix from \((0,1,5,0)\) to \((0,1,2,0)\).